Wayback Machine URL Extractor - Archived URLs avatar

Wayback Machine URL Extractor - Archived URLs

Pricing

from $3.50 / 1,000 results

Go to Apify Store
Wayback Machine URL Extractor - Archived URLs

Wayback Machine URL Extractor - Archived URLs

Extract every archived URL of any domain from the Internet Archive's Wayback Machine (CDX API). Recover lost or old pages, build redirect maps and run OSINT, with date and status filters. No API key, export to CSV or JSON.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Logiover

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Wayback Machine URL Extractor 🕰️ — Archived URLs from the Internet Archive

Recover every historical URL a website has ever published — straight from the Internet Archive's Wayback Machine. This Wayback Machine scraper queries the public CDX API to extract archived URLs and historical URLs for any domain — including pages that were deleted, renamed, or lost in a migration. Feed in one domain and get back up to tens of thousands of unique URLs, each with its capture date, archived HTTP status, MIME type, and a direct Wayback snapshot link.

Point it at one domain and it pulls the full historical URL inventory automatically. No API key, no login, no rate-limit headaches — one row per archived URL.

Looking to recover old URLs after a site migration, build a redirect map, find old/deleted pages, do OSINT on a domain's history, or pull a list of Internet Archive URLs without writing CDX queries by hand? This is the Internet Archive URL extractor that does it at scale.


✨ Key features

  • 🕰️ Full historical URL inventory — pulls every unique URL the Wayback Machine has on record for a domain, going back to 1996.
  • 🔑 No API key required — uses the open Internet Archive CDX API; no auth, no token, no login.
  • 🌐 Subdomain & path matching — capture the host plus all subdomains and paths, or narrow down to a single host or path prefix.
  • 📅 Date-range filtering — restrict to snapshots captured between two dates (fromDate / toDate).
  • Status-code filtering — keep only 200 OK captures and drop dead/redirected ones.
  • 🔗 Direct snapshot links — every row includes a ready-to-open web.archive.org/web/... URL.
  • 🌊 Streamed pagination — pages through massive result sets with the CDX resumeKey mechanism, so memory stays flat even on 100k+ URL domains.
  • 🔢 Result caps — set maxResults per domain, or 0 for unlimited.
  • 📋 Multiple domains per run — process a whole list in one go.
  • 📤 Export-ready — JSON, CSV, and Excel output via the Apify Dataset or REST API.

💡 Use cases

  • SEO migration & redirect maps — recover lost/old URLs after a site move and rebuild a complete 301 redirect map so you don't lose link equity.
  • Content recovery — find and restore blog posts, product pages, or docs that were deleted but still live in the archive.
  • OSINT & research — enumerate a target domain's historical footprint, old endpoints, removed pages, and forgotten subdomains.
  • Link reclamation — find old URLs that still earn backlinks so you can redirect them and reclaim the link value.
  • Finding old endpoints — surface admin paths, legacy APIs, and orphaned pages that no longer appear on the live site.
  • Competitive & web-archaeology research — reconstruct how a competitor's URL structure and content changed across years of snapshots.
  • Datasets — build a domain's URL/MIME/capture-history dataset for analysis.

📦 What you get

One row per unique archived URL, including:

FieldDescription
domainThe normalized domain this URL belongs to
urlThe original archived URL
timestampRaw 14-digit Wayback capture timestamp (YYYYMMDDhhmmss)
capturedAtISO 8601 form of the capture timestamp
statusCodeHTTP status the archive recorded for that capture (e.g. 200, 301, 404, or -)
mimeTypeContent type recorded at capture time (e.g. text/html)
digestWayback content digest (used internally for de-duplication)
snapshotUrlDirect link to the archived snapshot on web.archive.org

Example output

{
"domain": "nasa.gov",
"url": "http://www.nasa.gov/mission_pages/station/main/index.html",
"timestamp": "20120114043915",
"capturedAt": "2012-01-14T04:39:15.000Z",
"statusCode": "200",
"mimeType": "text/html",
"digest": "AB23CD45EF67GH89IJ01KL23MN45OP67",
"snapshotUrl": "https://web.archive.org/web/20120114043915/http://www.nasa.gov/mission_pages/station/main/index.html"
}

🚀 How to use it

  1. Click Try for free / Start.
  2. Add one or more domains to Domains (e.g. nasa.gov, bbc.com). URLs and www. are normalized automatically.
  3. (Optional) Pick a matchType, set a date range, filter by status code, or raise maxResults (0 = unlimited).
  4. Click Save & Start.
  5. Export the archived URL list as JSON, CSV, Excel or via API, and open any row's snapshotUrl to view the archived page.

⚙️ Input

FieldTypeDescriptionDefault
domainsarrayRequired. One or more domains or URLs (e.g. nasa.gov, bbc.com). Wildcards added automatically
matchTypeenumsubdomains (host + all subdomains + paths), host (exact host only), domain (host + subdomains), prefix (path prefix)subdomains
fromDatestringOptional YYYYMMDD lower bound on capture date
toDatestringOptional YYYYMMDD upper bound on capture date
filterStatusstringOptional — only return captures with this HTTP status (e.g. 200)
maxResultsintegerMax unique URLs per domain. 0 = unlimited5000
proxyConfigurationobjectProxy settings. Defaults to Apify ProxyApify Proxy

Example input

{
"domains": ["nasa.gov"],
"matchType": "subdomains",
"fromDate": "20100101",
"toDate": "20201231",
"filterStatus": "200",
"maxResults": 5000,
"proxyConfiguration": { "useApifyProxy": true }
}

🔍 How it works

  1. Each domain you provide is normalized — scheme, www., paths and wildcards are stripped down to a bare host.
  2. A CDX API query is built from your matchType, date range, and status filter, requesting the original, timestamp, statuscode, mimetype and digest fields with collapse=urlkey so each URL appears only once instead of returning every capture of it.
  3. Results are paged using the CDX showResumeKey / resumeKey mechanism, and each page is pushed to the dataset in a batch — so even domains with hundreds of thousands of archived URLs stream out without exhausting memory.
  4. For every row, a direct snapshotUrl is constructed in the https://web.archive.org/web/<timestamp>/<original-url> form, so you can open the exact archived page.
  5. Slow responses, 5xx, and 429 errors are retried with exponential backoff on a fresh proxy IP — the CDX index can be slow, so retries keep large runs reliable.

🧰 Tips & best practices

  • Big domains (news sites, government sites) can have hundreds of thousands of archived URLs. Start with the default maxResults of 5000 to gauge volume, then raise it or set 0 for everything.
  • Use filterStatus: "200" to skip dead and redirected captures and keep only pages that actually resolved — ideal for building redirect maps.
  • Narrow with fromDate / toDate (both YYYYMMDD) when you only care about a specific era of the site.
  • Use matchType: "subdomains" to sweep every subdomain at once, or host for a single host without its subdomains.
  • Sort or filter the dataset by mimeType to isolate just HTML pages, images, PDFs, etc.

❓ FAQ

How do I get all URLs of a website from the Wayback Machine?

Add the domain to Domains, leave matchType on subdomains, set maxResults to 0 for everything, and run it. The actor queries the Internet Archive CDX API and returns one row per unique archived URL.

Can I find old or deleted pages of a domain?

Yes — that's the core use case. The Wayback Machine keeps URLs even after they're removed from the live site, so deleted blog posts, retired product pages, and old endpoints all show up in the results with a snapshotUrl to view them.

How do I export archived URLs to CSV or JSON?

Run the actor, then download the dataset as CSV, JSON or Excel (or pull it via the REST API). Every archived URL is one row, so it drops straight into a spreadsheet or pipeline.

Is this free and without an API key?

The Internet Archive CDX API is public and requires no API key and no login. You only pay for the Apify platform usage of the run itself.

Can I filter by date or status code?

Yes — set fromDate / toDate (YYYYMMDD) to restrict to a capture window, and filterStatus (e.g. 200) to keep only captures with a specific HTTP status.

How many URLs can it return?

Up to tens of thousands per domain — set maxResults to 0 for unlimited. Results stream to the dataset in pages via the CDX resumeKey, so even 100k+ URL domains run without memory issues.

Why are some statusCode values -?

The Wayback index sometimes records captures without a stored status code (e.g. revisit records). Those rows are still valid archived URLs.

📝 Changelog

2026-06-15

  • Initial release — extract archived URLs from the Wayback Machine CDX API with date/status filters, CSV/JSON export, no API key.